Manifold Assumption and Defenses Against Adversarial Perturbations
نویسندگان
چکیده
In the adversarial-perturbation problem of neural networks, an adversary starts with a neural network model F and a point x that F classifies correctly, and applies a small perturbation to x to produce another point x′ that F classifies incorrectly. In this paper, we propose taking into account the inherent confidence information produced by models when studying adversarial perturbations, where a natural measure of “confidence” is ‖F (x)‖∞ (i.e. how confident F is about its prediction?). Motivated by a thought experiment based on the manifold assumption, we propose a “goodness property” of models which states that confident regions of a good model should be well separated. We give formalizations of this property and examine existing robust training objectives in view of them. Interestingly, we find that a recent objective by Madry et al. encourages training a model that satisfies well our formal version of the goodness property, but has a weak control of points that are wrong but with low confidence. However, if Madry et al.’s model is indeed a good solution to their objective, then good and bad points are now distinguishable and we can try to embed uncertain points back to the closest confident region to get (hopefully) correct predictions. We thus propose embedding objectives and algorithms, and perform an empirical study using this method. Our experimental results are encouraging: Madry et al.’s model wrapped with our embedding procedure achieves almost perfect success rate in defending against attacks that the base model fails on, while retaining good generalization behavior.
منابع مشابه
On the Connection between Differential Privacy and Adversarial Robustness in Machine Learning
Adversarial examples in machine learning has been a topic of intense research interest, with attacks and defenses being developed in a tight back-and-forth. Most past defenses are best-effort, heuristic approaches that have all been shown to be vulnerable to sophisticated attacks. More recently, rigorous defenses that provide formal guarantees have emerged, but are hard to scale or generalize. ...
متن کاملMachine vs Machine: Minimax-Optimal Defense Against Adversarial Examples
Recently, researchers have discovered that the state-of-the-art object classifiers can be fooled easily by small perturbations in the input unnoticeable to human eyes. It is known that an attacker can generate strong adversarial examples if she knows the classifier parameters. Conversely, a defender can robustify the classifier by retraining if she has the adversarial examples. The cat-and-mous...
متن کاملSpatially Transformed Adversarial Examples
Recent studies show that widely used deep neural networks (DNNs) are vulnerable to carefully crafted adversarial examples. Many advanced algorithms have been proposed to generate adversarial examples by leveraging the Lp distance for penalizing perturbations. Researchers have explored different defense methods to defend against such adversarial attacks. While the effectiveness of Lp distance as...
متن کاملAttack and Defense of Dynamic Analysis-Based, Adversarial Neural Malware Classification Models
Recently researchers have proposed using deep learning-based systems for malware detection. Unfortunately, all deep learning classification systems are vulnerable to adversarial attacks where miscreants can avoid detection by the classification algorithm with very few perturbations of the input data. Previous work has studied adversarial attacks against static analysisbased malware classifiers ...
متن کاملCertified Defenses against Adversarial Examples
While neural networks have achieved high accuracy on standard image classification benchmarks, their accuracy drops to nearly zero in the presence of small adversarial perturbations to test inputs. Defenses based on regularization and adversarial training have been proposed, but often followed by new, stronger attacks that defeat these defenses. Can we somehow end this arms race? In this work, ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1711.08001 شماره
صفحات -
تاریخ انتشار 2017